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Abstract 

In the past two decades, some major efforts have been made to reduce ex- 
act (e.g. integer, rational, polynomial) linear algebra problems to matrix 
multiplication in order to provide algorithms with optimal asymptotic 
complexity. To provide efficient implementations of such algorithms one 
need to be careful with the underlying arithmetic. It is well known that 
modular techniques such as the Chinese remainder algorithm or the p- 
adic lifting allow very good practical performance, especially when word 
size arithmetic are used. Therefore, finite field arithmetic becomes an 
important core for efficient exact linear algebra libraries. In this paper, 
we study high performance implementations of basic linear algebra rou- 
tines over word size prime fields: specially the matrix multiplication; our 
goal being to provide an exact alternate to the numerical BLAS library. 
We show that this is made possible by a careful combination of numeri- 
cal computations and asymptotically faster algorithms. Our kernel has 
several symbolic linear algebra applications enabled by diverse matrix 
multiplication reductions: symbolic triangularization, system solving, 
determinant and matrix inverse implementations are thus studied. 
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1 Introduction 

Finite fields play a crucial role in computational algebra. Indeed, finite fields are 
the basic representation used to solve many integer problems. The whole solu- 
tions are then gathered via the Chinese remainders or lifted p-adically. Among 
those problems are integer polynomial factorization [47] , integer system solving 
[9, 44], integer matrix normal forms [23] or integer determinant [34]. Finite 
fields are of intrinsic use in polynomial linear algebra [26] but also in cryptology 
(e.g. large integer factorization [38], discrete logarithm computations [40]) or 
for error correcting codes. Moreover, nearly all of these problems involve linear 
algebra resolutions. Therefore, a fundamental issue is to implement efficient el- 
ementary arithmetic operations and very fast linear algebra routines over finite 
fields. 

We propose a way to implement the equivalent of the basic BLAS level 1,2, 
and 3 numerical routines (respectively dot product, matrix- vector product and 
matrix-matrix product), but over finite fields. We will focus on implementations 
over fields with small cardinality, namely not exceeding machine word size, but 
with any characteristic (consequently, we do not deal with optimizations for 
powers of 2 cardinalities). For instance, we show that symbolic matrix multi- 
plication can be as fast as numerical matrix multiplication (see section 3) when 
using word size prime fields. Our aim is not to rebuild some specialized routines 
for each field instance. Instead, the main idea is to use a very efficient and au- 
tomatically tuned numerical library as a kernel (e.g. ATLAS [46]) and to make 
some conversions in order to perform an exact matrix multiplication (i.e. with- 
out any loss of precision). The efficiency will be reached by performing as few 
conversions as possible. Several alternatives to this approach exist: one would 
be to implement a core linear algebra with integer arithmetic. Unfortunately, 
new architectures focus on numerical arithmetic and therefore by using integer 
arithmetic we would lose a factor of 2 or 4 due to the SIMD (single instruction, 
multiple data) SSE speed-up of the numerical routines. Note that SSE4 with 
some integer support is announced for 2008 and might then change some of this 
point of view. Anyway, another feature of our approach is to rely on a large 
community of effort for the numerical handling of linear algebra routines. We 
want to show in this paper that no real gain could be obtained by trying to 
mimic their effort over just using it. 
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Then, building on this fast numerical blocks, we can use fast matrix multi- 
plication algorithms, such as Strassen's or Winograd's variant [24, §12]. There, 
we use exact computation on a higher level and therefore do not surfer from 
instability problems [30]. 

Many algorithms have been designed to use matrix multiplication in order 
to be able to prove an optimal theoretical complexity. In practice those exact 
algorithms are only seldom used. This is the case, for example, in many linear 
algebra problems such as determinant, rank, inverse, system solution or minimal 
and characteristic polynomial. We believe that with our kernel, each one of those 
optimal complexity algorithms can also be the most efficient. One goal of this 
paper is then to show the actual effectiveness of this belief. In particular we 
focus on factorization of matrices of any shape and any rank. 

Some of the ideas from preliminary versions of this paper [17], in particular 
the BLAS-based matrix multiplication for small prime fields, are now incorpo- 
rated into the Maple computer algebra system since its version 8 and also into 
the 2005 version of the computer algebra system Magma. Therefore an effort 
towards effective reduction has been made [18] in C++ and within Maple by 
A. Storjohann[6]. Effective reduction for minimal and characteristic polynomial 
were proposed in [20] and A. Steel has reported on similar efforts within his 
implementation of some Magma routines. 

In this paper, the matrix factorization, namely the exact equivalent of the 
LU factorization is thus extensively studied. Indeed, unlike numerical matrices, 
exact matrices are very often singular, even more so if the matrix is not square ! 
Consequently, Ibarra, Moran and Hui have developed generalizations of the LU 
factorization, namely the LSP and LQUP factorizations [33]. Then we adapt 
this scheme to rank, determinant, inverse (classical or Moore-Penrose), nullspace 
computations, etc. There, we will give not only the asymptotic complexity 
measures but the constant factor of the dominant term. Most of these terms 
will give some constant factor to the multiplication time and we will compare 
those theoretical ratios to the efficiency that we achieve in practice. This will 
enable us to give a measure of the effectiveness of our reductions (see especially 
section 6). 

Now, we provide a full C++ package available directly [13] or through the 
exact linear algebra library LinBox 1 [16]. Extending the work undertaken by 
the authors et al.[41, 17, 4, 25, 14, 18, 20], this paper focuses on matrix multi- 
plication with an extended Winograd variant optimizing memory allocation ; on 
simultaneous triangular system solving; on matrix factorization and improved 
constant factors of complexity for many linear algebra equivalent routines (in- 
verse, squaring, upper-lower or upper-upper triangular multiplication, etc.). 

The paper is organized as follows. Section 2 introduces some material for the 
evaluation of arithmetical costs of recursive algorithms; we also motivate our 
choice to represent elements of a finite field; Then section 3 presents efficient 
ways to implement matrix multiplication over generic prime fields, including a 
study of fast matrix multiplication. Section 4 deals with the matrix multipli- 

www . 1 inalg . org 
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cation based simultaneous resolution of n triangular systems. Laslty, section 5 
presents implementations of several matrix factorizations and their applications 
with a study of complexity and of efficiency in practice. 

2 Preliminaries 

2.1 Finite field arithmetic 

The first task, to implement exact linear algebra routines, is to develop the un- 
derlying arithmetic. Indeed, any finite field, except GF(2), do not map directly 
to the arithmetical units of nowadays processors and a software emulation is 
therefore mandatory. This has been well studied in literature, and we refer to 
[14] and references therein for a survey on this topic. Here, we recall the differ- 
ent ways of implementing such arithmetic and we will motivate our choice of a 
particular one for efficient linear algebra routines. 

2.1.1 Implementations 

Representation of finite fields elements plays a crucial role in the efficiency of 
arithmetic operations. From now on, we will count arithmetic operations in 
terms of field operations, that is we will count addition, subtraction, multipli- 
cation and division in the arithmetic complexity results. 

A usual way to implement prime fields arithmetic is to map the elements of 
the field to integers modulo a prime number, defined by its characteristic. From 
now on, we will focus on prime fields with characteristic no greater than a word 
size (e.g. 32 bits). In this basic case, various representations and arithmetics 
can be used: 

• Classical representation with integer divisions. 

Integers between and p — 1 or between (1 — p)/2 and (p — l)/2 are 
used; additive group operations are done with machine integers operations 
followed by a test and a correction; multiplication is followed by machine 
remaindering while division is performed via the extended gcd algorithm. 

• Montgomery representation. 

This representation, proposed in [37], allows to avoid costly machine re- 
maindering within the multiplication. A shifted representation is used and 
remaindering is replaced by multiplications. Note that others operations, 
except the division, stay identical. 

• Floating point inverse. 

Another idea to reduce remaindering cost in multiplication is to precom- 
putc the inverse of the characteristic p within a floating point number. 
Therefore, only two floating point multiplications and some rounding are 
necessary. However, floating point rounding may induce a ±1 error and 
then an adjustment is required, as implemented in Shoup's NTL library 
[43]. 
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• Discrete logarithm (also called Zech logarithm). 

Here, elements are seen as a power of a generator of the multiplicative 
group, namely a primitive element. As a consequence, multiplicative group 
operations can be performed only by addition or subtraction modulo p — 
1. Nevertheless, this representation makes the addition/subtraction more 
complicated in the field. In particular, these operations need some table 
lookup; see [14, §2.4]. 

Extension fields, denoted GF{p k ) 1 arc usually implemented via polynomials 
over the prime field Z/pZ modulo an irreducible polynomial of degree k. Thus, 
operations in the extension reduce to polynomial arithmetic. An alternative 
is to tabulate entries and use the Zech logarithm representation also. As for 
prime fields, some representations can be used to avoid the costly remaindering 
phase within the multiplication. We will not discuss any implementations over 
extension field in this paper. We let the reader refer to [15] for details on 
data structures, arithmetic and matrix multiplication over small extension fields. 
From now on, when we will refer to finite fields this will mean word-size prime 
fields and the extensions for which the trick of [17, §4] is usable. 

2.1.2 Ring homomoprphism and delayed reduction 

As a primitive tool for implementing linear algebra routines, the efficiency of the 
finite field representation needs to be well studied. In [14] the author analyzes 
the efficiency of finite field arithmetic according to a chosen representation. It 
has been shown that atomic operations (e.g. addition, multiplication) can be 
performed more efficiently than with the classic method depending on the ar- 
chitecture. In particular, it appears that memory access based implementations 
(i.e. discrete logarithm) and floating point based implementations (i.e. floating 
point inverse) are more efficient on older architecture such as Ultra Sparc. Nev- 
ertheless, with newer architecture such as Pentium III and Pentium 4, integer 
machine operations become more efficient and outperform other implementa- 
tions, except discrete logarithm for multiplicative group operations. 

However, for linear algebra, the primary operation is the succession of two 
operations: a multiplication followed by an addition; this operation is commonly 
called AXPY (also "fused- mac" or FMA within hardware). This operation 
clearly influences the efficiency of vectors dot product which is one of the main 
operations of classic linear algebra. However, optimized atomic operation 

is deprecated since one would rather use delayed divisions. This technique 
consists in successive multiplications and accumulations without any division. 
Divisions intervene either just before an overflow occurs within the hardware 
data, or only after a fixed numbers of accumulations. 

Indeed, any prime field Z p can be naturally embedded into Z by representing 
its elements with an integer of an interval [m, M], such that M — m = p — 1. 
The reverse conversion consists in applying a reduction modulo p to the integer 
value. 

The ring structure being preserved by these homomorphisms, any ring algo- 
rithm over Z p can be transposed into a ring algorithm over Z. 



G 



Prime Field Linear Algebra 



J-G. dumas, P. Giorgi, C. Fernet 



Now the machine integer arithmetic uses a fixed number of bits 7 for the 
integer representation: 7 = 32 for int, 7 = 24 (rcsp. 7 = 53) for single (resp. 
double) precision floating point values, etc. 

Using this approximate integer arithmetic, one has therefore to ensure that 
the computation of the integer algorithm will not overflow the representation. 
Hence for each integer algorithm, a bound on the maximal computed value has 
to be given, depending on m and M. 

For example, if the representation is interval is [0,p — 1], one can perform A 
accumulations without any divisions if 

A(p-l) 2 <2^<(A + l)(p-l) 2 (1) 

Note that if signed words are available, a centered representation can be used 
(i.e. — 2^- < x < for the storage of an element x of the odd prime field) 
and the equation 1 becomes 

» (^) ! < ^< (1 + A ,(^i) 2 (2) 

which improves A by a factor of 2. 

Hence, the bottleneck of divisions can be amortized since only divisions 
will occur in a n-dimensional vector dotproduct. 

Contrary to atomic operations, floating point based implementations for 
dotproduct tend to be the most efficient on average. In particular, timings 
are constant and achieve almost half of the peak of arithmetical unit while the 
timings of others implementations drop as soon as the size of the finite field 
increases. However, when small primes are used, one can improve these timings 
to almost the peak of the machine by using others implementations [14, §3.4]. 

According to these results and the necessity of genericity, we provide imple- 
mentations based on generic finite fields (e.g. use of C+ + template mechanism). 
However, in this paper, we mainly use a floating point based implementation 
for our finite fields arithmetic, called Zpz-double. This choice is principally 
motivated by the use of optimized numerical basic linear algebra operations 
through the BLAS library. Indeed, one can easily benefit from these libraries by 
simply mapping linear algebra operations over finite fields to numeric computa- 
tions and delayed divisions. This will be extensively explained in sections 3 and 
4. Therefore, the choice of floating point based representations for finite field 
elements will be an asset since it will avoid any data conversion. Possibly, we 
may use a different finite field implementation in order to compare efficiencies. 
There, we will use the notation Zpz-int, meaning a word size integer based im- 
plementation. As we will see throughout the rest of the paper, the combination 
of BLAS and Zpz-double implementation will allow us to approach numerical 
efficiency for linear algebra problems over finite fields. 

2.2 Recursion materials for arithmetical complexity 

The following two lemmas will be useful to study the constant factor of linear 
algebra algorithms compared to matrix multiplication. The first one gives the 
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order of magnitude when the involved matrices will be square: 
Lemma 2.1. Let m be a positive integer and suppose that 

1. T(m) = CT{ 7 -j) + am u + e(m), with e(m) < 3m 2 for some constants 
C,a,u,g. 

2. T(l) = e for some constant e. 

3. log 2 (C) <uj. 
Then T{m) = 0(m w ). 

Proof. Let t = log 2 (m). The recursion gives, 

T(m) = C*T(1) + a m tf l "f + £ Ce®. 

1 _ 2^ i= o 2 

Then, on the one hand, if C ^ 4 this yields T(m) = ^^m^ + kC l + g'm 2 , 
where 5' < and k < T(l) - 2 ° 2 _" g , - 5'. On the other hand, when C = 4, 
we have T(m) = a^m" + fc'C + jm 2 log 2 (m), where fc' < T(l) - In 
both cases, with C* = m lo &( c ), this gives T(m) = ^r^m^ + o(m w ). □ 

Now we give the order of magnitude when the matrix dimensions differ: 

Lemma 2.2. Let m and n be two positive integers and suppose that 

1. T(m,n) = Ei=i c i T (T' n_ d *f ) + amU + bm u> ~ 1 n + e(m,n), with C = 
J2i=i c i> D = c idi, 2 < ijJ and e(m, n) < gm 2 + hmn . 

2. T(1,F) < eF for a constant e. 

3. log 2 (C) <w-l 

Then T(m, n) = C(m w + m^n). 

Proof. As in the preceding lemma, we use the recursion and geometric sums to 
get 

k k 

T(m, n) = ^ c n ■ ■ ■ X] Ci * T ( 1 ' n _ • • • ' dt > m ))+ 

ii = l i t =l 
A; fe 

+ c n H(m/2,n- d l m/2) . . . + ^ c ix . . . ^ c it H(l,n - f(d 1: . . .,d t ,m)) 

ii = l ii = l it = l 

(3) 
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Thus, we get 

t 

am" + 0m u - 1 n < T(m, n) < am" + prrT^n + C*T(1, n) + CVff(^, n). 

i=i 21 

The last term is bounded by gm 2 _ \J + fmn _c when 4 and C^2. 

4 1— "2 

In this case C*T(1, rc) + £* =1 Cff , n) < m Io s^ c ) ((e + + = 

0(m" + m"~ x n). When C = 2, a supplementary log 2 (m) factor arises in the 
small factors, but the order of magnitude is preserved since log 2 (C) + 1 = 2 < 
uj. □ 

These two lemmas are useful in the following sections where we solve (e.g. suppose 
T(m) — am" in a recurring relation for a) to get the actual constant of the 
dominant term. Thus, when we give an equality on complexities, this equality 
means that the dominant terms of both complexities are equal. In particular, 
some lower order terms may differ. 

3 Matrix multiplication 

We propose a design for a matrix multiplication kernel routine over a word-size 
finite field, based on the three following features: 

1. delayed modular redution, as explained section 2.1.2, 

2. cache tuning and floating point arithmetic optimizations using BLAS, 

3. Strassen-Winograd fast algorithm. 

3.1 Cache tuning using BLAS 

In most of the modern computer architecture, a memory access to the RAM is 
more than one hundred times slower than an arithmetic operation. To circum- 
vent this slowdown, the memory is structured into two or three levels of cache 
acting as buffers to reduce the number of accesses to the RAM and reuse as 
much as possible the buffered data. This approach is only valid if the algorithm 
involves many computations with local data. 

In linear algebra, matrix multiplication is the better suited operation for 
cache optimization: it is the first basic operation, for which the time complexity 
0(n 3 ) is an order of magnitude higher than the space complexity 0(n 2 ). Fur- 
thermore it plays such a central role in linear algebra, that every other algorithm 
will take advantage of the tuning of this kernel routine. 

These considerations have driven the development of basic linear algebra 
subroutines (BLAS) [11, 46] for numeric computations. One of its main achieve- 
ment is the level 3 set of routines, based on a highly tuned matrix multiplication 
kernel. 
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For computations on a word-size finite field, a similar approach could be 
developed, e.g. following [29] for block decomposition. Instead, we propose 
to simply wrap these numerical routines to form the integer algorithm of the 
delayed modular approach of the previous section. This will enable to take 
benefit from both the efficiency of the floating point arithmetic and the cache 
tuning of the BLAS libraries. Furthermore relying on the generic BLAS inter- 
face makes it possible to benefit from the large variety of optimizations for all 
existing architectures and ensures a long term efficiency thanks to the much 
larger development effort existing for numerical computations. 




Figure 1: Blocking classical matrix multiplication, on a Xeon, 3.6GHz. 



Figure 1 shows the advantage of this method (FFLAS : : classic) compared to 
two other implementations: the naive algorithm (long-noblock), and a hand- 
made cache tuned implementation, based on block decomposition of the input 
matrices, so that each block product could be performed locally in the L2 cache 
memory (long-block-40, for a block dimension 40). The graph compares the 
computation speed in millions of field operations per seconds (Mfops) for differ- 
ent matrix orders. As a comparison we also provide the computation speed of 
the equivalent numerical BLAS routine dgemm. This approach improves on the 
efficiency of the two other methods over a finite field and the overhead of the 
modular reductions is limited. Finally, the (FFLAS : : f gemm) implementation is 
the most efficient thanks to the combination of numerical computations and a 
fast matrix multiplication algorithm which is discussed in the next section. 

3.2 Winograd fast algorithm 

The third feature of this kernel is the use of a fast matrix multiplication algo- 
rithm. We will focus on Winograd's variant [24, algorithm 12.1] of Strassen's 
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algorithm [45]. We denote by MM(n) the dominant term of the arithmetic 
complexity of the matrix multiplication. The value of MM(ti) thus reflects the 
choice of algorithm, e.g. MM(n) = 2n 3 for the classical algorithm, and mean 
that the actual complexity of the classical algorithm is 2n 3 + 0(n 2 ). We also 
denote by u> the asymptotic exponent of MM(n), it is thus 3 for the classical 
algorithm, log 2 (7) ~ 2.807354922 for the Strassen-Winograd variant, and the 
best known exponent is about 2.375477 by [7]. 

In [30] Winograd's variant is discarded for numerical computations because 
of its bad stability and despite its better running time. In [35] aggregation- 
cancellation techniques of [36] are also compared. They also give better stability 
than the Winograd variant but worse running time. For exact computation, 
stability is no longer an issue and Winograd's faster variant is thus preferred. 

3.2.1 A Cascade structure 

Asymptotically, this algorithm improves on the number of arithmetic operations 
required for matrix multiplication from MM(n) = 2n 3 to MM(n) = 6n 2 ' 8074 . 
But for a given n, the total number of arithmetic operations can be reduced 
by switching after a few recursive levels of Winograd's algorithm to the classic 
algorithm. Table 1 compares the number of arithmetic operations depending on 
the matrix order and the number of recursive levels. 



Recursive levels of Winograd's algorithm 



n 


Classic 


1 


2 


3 


4 


5 


6 


4 


112 


144 


214 










8 


960 


1024 


1248 


1738 








16 


7936 


7680 


8128 


9696 


13126 






32 


64512 


59392 


57600 


60736 


71712 


95722 




64 


520192 


466944 


431104 


418560 


440512 


517344 


685414 



Table 1: Number of arithmetic operations in the multiplication of two n x n 
matrices 



This phenomenon is amplified by the fact that additions in classic matrix 
multiplication are cheaper than the ones in Winograd algorithm since they take 
advantage of the cache optimization of the BLAS routine. As a consequence, 
the optimal number of recursive levels depends on the architecture and must 
be determined experimentally It can be described by a simple parameter: the 
matrix order w for which one recursive level is as fast the classic algorithm. 
Then the number of levels I is given by the formula 



I = 



log 2 



n 



W- 



+ 1. 
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3.2.2 Schedule of the algorithm 

We based our implementation of Winograd's algorithm on two different sched- 
ules. For the operation C <— A x B we use that of [12, Fig. 1] and for the 
extended C <— oeAx B + f3C, that of [32, Fig. 6] that we recall in table 2. More 
details about tasks scheduling and memory efficient variants of Winograd's al- 
gorithm can be found in [21]. 



# 


operation 


loc. 


# 


operation 


loc. 


1 


S\ = A21 + A22 


Xi 


12 


S4 = A12 — S2 


Xi 


2 


T\ = B\2 — B\\ 


x 2 


13 


T4 = T2 — B 2 i 


x 2 


3 


P5 = a.S\T\ 


x 3 


14 


C\2 = aSiB 22 + C12 


C\2 


4 


C22 = P 5 + PC22 


C22 


15 


U 5 = U 2 + C12 


C12 


5 


C\2 = P5 + PC12 


C12 


16 


P 4 = aAi 2 T 4 - PC21 


C21 


6 


S2 = S\ — An 


Xi 


17 


S 3 = A n - A 2 i 


Xi 


7 


T2 = B22 — Ti 


x 2 


18 


T3 = B22 — B\2 


x 2 


8 


Pi = aAuBu 


x 3 


19 


U 3 - aS 3 T 3 + U 2 


x 3 


9 


C u =Pi+ 0Cu 


Cn 


20 


U7 = U3 + C22 


C22 


10 


U 2 = aS 2 T 2 + Pi 


x 3 


21 


Uq = U3 — C21 


C21 


11 


Ui = aAi 2 B 2 i + Cu 











Table 2: Schedule for operation C <— a A x B + (3C with 3 temporaries 



3.2.3 Control of the overflow 

Since Winograd's algorithms will be used with delayed modular reductions, one 
has to ensure that any intermediate computation will fit in the underlying fixed- 
size integer representation being used. Indeed, intermediate values can become 
large in this algorithm, and the former bound for the dot-product no-longer 
holds. 

The main result of this section is that, in the worst case, the largest in- 
termediate computation occurs during the recursive computation of the sixth 
recursive product Pq (see appendix A). This result generalizes [17, theorem 3.1] 
for the computation of AB + (iC. 

Theorem 3.1. Let A e Z mxk , B S Z fex ™ C E Z mxn be three matrices and 
f3 G Z with vriA < cii,j < Ma, wb < bij < Mb and mc < Cjj < Mc- 
Moreover, suppose that < — vtia < Ma, < -raj < Mb, < — mc < Mc, 
Mc < Mb and \(3\ < Ma, Mb- Then every intermediate value z involved in the 
computation of Ax B + [3C with I (I > 1) recursive levels of Winograd algorithm 
satisfy: 



YA < ( — g— M A + ~^— m A ) I — 7; — Mb + ~^^m B 

Moreover, this bound is optimal. 





k 


) 


J 1 . 
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The proof is given in appendix A. 

Using a positive integer representation of the prime field elements (integers 
between and p — 1), the following corollary holds: 

Corollary 3.2 (Positive modular representation). Using the same notations, 
with dij ,bij,Cij, (3 e [0 ... p — 1] , we have 



z < 



m 3 


k 







(v-i? 



Instead, using a balanced representation (integers between — ^— ^ and ^rp), 
this bound can be improved: 



Corollary 3.3 (Balanced modular representation). Using the same notations 
j— l p— ] 

2 ' • ' 2 

'3 r 



with Ojj, bij,dj, 13 & [— . . . ^rp], we have 



z < 



(P-1) 2 



Corollary 3.4. One can compute I recursive levels of Winograd algorithm with- 
out modular reduction over integers of 7 bits as long as k < kwinograd where 



k Winograd I 2 ^ I ^ 



27 +2 

k ((l + 3%T-l)) 
/or a positive modular representation and 

( 27+2 \ 1 

kwinograd — — — —2 +1)2 

\(3'(P-1)) / 
for a balanced modular representation. 



3.3 Timings and comparison with numerical routines 

This section presents experiments of our implementation of the matrix multi- 
plication kernel described above. 

The experiments use two different BLAS library: the automatically tuned 
BLAS ATLAS [46], and the BLAS by Kazushigc Goto [28] refered to as GOTO. 
We used the gec compiler version 4.1 on the Xcon machine and the ice compiler 
version 9.0 on the Itanium. We recall that dgemm refers to the BLAS matrix 
multiplication routine over double precision floating point numbers. Similarly, 
we named our routine over a word-size finite field f gemm. 

The tables 3 and 4 report timings obtained for both exact and numeric ma- 
trix multiplication. First the comparison shows that the exact computation over 
a word size finite field (modulo 65521 on these tables) can reach a similar range 
of efficiency as the numerical computation. For increasing matrix dimensions, 
the exact computation becomes even more efficient (see also figure 1), thanks 
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n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 



ATLAS 


fgcmm 


0.38s 


2.73s 


8.59s 


36.34s 


95.21 


134.03s 


190.21s 


258.08 


dgemm 


0.37s 


2.98s 


10.02s 


46.10s 


126.38s 


188.97s 


267.83s 


368.30s 


/ gemm 
dqemm 


1.02 


0.92 


0.86 


0.79 


0.75 


0.71 


0.71 


0.70 



GOTO 


fgcmm 


0.36s 


2.53s 


7.95s 


33.44s 


87.46s 


124.86s 


177.25s 


238.00s 


dgemm 


0.34s 


2.65s 


8.90s 


41.01s 


112.31s 


167.20s 


237.16s 


325.62s 


J gemm 
dgemm 


1.05 


0.96 


0.89 


0.82 


0.78 


0.75 


0.75 


0.73 



Table 3: Comparison between f gemm and dgemm on a Xeon, 3.6GHz 





n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 






ATLAS 


fgcmm 


0.46s 


3.22s 


10.14s 


42.28s 


110.64s 


163.53s 


225.08s 


296.56s 


dgemm 


0.45s 


3.49s 


11.45s 


53.12s 


144.45s 


215.53s 


305.21s 


419.00s 


J gemm 
dgemm 


1.01 


0.92 


0.89 


0.80 


0.77 


0.76 


0.74 


0.71 




GOTO 


fgemm 


0.43s 


2.99s 


9.35s 


39.21s 


104.07s 


152.12s 


209.22s 


277.32s 


dgemm 


0.40s 


3.18s 


10.61s 


48.88s 


133.75s 


200.11s 


283.94s 


390.37s 


}gemm 
dqemm 


1.06 


0.94 


0.88 


0.80 


0.78 


0.76 


0.74 


0.71 



Table 4: Comparison between fgemm and dgemm on Itanium2, 1.3GHz 



to the use of Winograd's algorithm (improvement factor between 13% and 29% 
for dimension 10 000). 

These experiments also show the advantage of relying on a generic interface 
for numerical BLAS: the exact computation will directly take advantage of the 
improvements of the best numerical routine. This appears when comparing 
GOTO and ATLAS on these two target architecture, where GOTO is about 
10% faster. 

4 Triangular system solving with matrix right /left 
hand side 

We now discuss the implementation of solvers for triangular systems with matrix 
right hand side (or equivalcntly left hand side). The resolution of such systems 
plays a central role in many linear algebra problems, e.g. it is the second main 
operation in block Gaussian elimination after matrix multiplication as will be 
recalled in section 5.1. This operation is commonly named trsm in the BLAS 
convention. In the following, we will consider without loss of generality the 
resolution of an upper triangular system with matrix right hand side, i.e. the 
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operation B <— U~ 1 B, where U is m x m upper triangular and B is m x n. 

Following the approach of the BLAS numerical routine, our implementation 
is based on a block recursive algorithm to reduce the computation to matrix 
mult iplications. 

Now similarly to our approach with matrix multiplication, the design of our 
implementation also focuses on delaying the modular reductions as much as 
possible. As will be shown in section 4.2, delaying the whole resolution leads to 
a quick growth in the size of coefficients. Therefore we also present in section 
4.3 another way of delaying these modular reductions. We lastly present how 
to combine these two techniques within a multi-cascade algorithm. 



4.1 The block recursive algorithm 

Algorithm trsm recalls the block recursive algorithm. 



Algorithm 1: trsm (A, B) 

Data: A e Z/ p Z mxm , B e Z/ p Z mxI \ 
Result: X G Z/ p Z mx ™ such that AX = B. 
begin 

if m = 1 then 
[ X := A^\ x B 

else 

/* splitting matrices into two blocks of sizes and 

\f 1 



A 




X 




B 


' A 1 A 2 ' 

A 3 _ 




x l 

x 2 




Bi 
B 2 



*/ 

X 2 :=trsm (A 3 ,B 2 ) 
B\ := B\ — A 2 X 2 
_ Xi :=trsm (A 1 ,B 1 ) 

end 



Lemma 4.1. Algorithm trsm is correct and the leading term of its arithmetic 
complexity over Z/pZ is 

i r n ~i 
rM(m,n)_ -\MM(m) 

This complexity is m 2 n using classic matrix multiplication. 
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Proof. Extending the previous notation MM (n) , we denote by MM (m,k,n) the 
cost of multiplying amxfcbyafcxn matrices. The cost function TRSM(m, n) 
satisfies the following equation: 

Til Til Til 

TRSM(m, n) = 2TRSM( — , n) + MM( — , — , n). 

Let t = log 2 (m). Although the algorithm works for any n, we restrict the 
complexity analysis to the case where m < n for the sake of simplicity. We then 
have: 

m . 1 



777 1 T 77 n 

TRSM(m,n) = 2TRSM(— ,77) + — — ■ — MM(to) 



2 ' ; 2"- 1 
1 



= 2*TRSM(l,n) + — T — MM (to) 



to I 



77 



rn 



.1- w 

2 



As TRSM(l,n) = 2n and (2" 1 )* = m u 1 , we obtain the expected complexity 

TRSM(777, Tl) = 21^2 [^1 MM(777) + 0{m 2 + 77777). □ 



4.2 Delaying reductions globally 

As for matrix multiplication, the delayed computation relics on the fact that 
ring operations over the finite field can be replaced by ring operations over Z 
using the ring homomorphisms described in section 2.1.2. However, triangular 
system resolutions involve, in the general case, field operations: the divisions by 
the diagonal elements of the triangular matrix. Therefore this technique is only 
valid with unit diagonal matrices. 

In the general case, the triangular matrix is made unit diagonal by the 
following factorization: A = DU, where D is diagonal and U is unit diagonal 
upper triangular. Then the system UX = D^ 1 B only involves ring operations 
and can be solved over Z. This normalization leads to an additional cost of 
0(77777) arithmetic operations (see [18] for more details). 

Now the integer computation with a fixed sized arithmetic (e.g. the floating 
point arithmetic) is exact as long as all intermediate results of the computation 
do not exceed the bit capacity of the representation. Therefore we now propose 
bounds on the values computed by the algorithm over Z. 

Theorem 4.2. Let T E Z" x " be a unit diagonal upper triangular matrix and 
b £ Z n , with 777 < T itj < M and 777 < b { < M and 777 < < M. Let x = 
{xi)ie[i...n] 6 Z" be the solution of the system Tx = b. Then V k G [0 ... 77 — 1] : 

( -Uk < x n -k < v k for k even, 
\ -Vk < x n -k < u k for k odd 

with 

f u k = M^in^M + l) k - M±m(M - l) fe , 
\ v k = ^pi(M + l) fe + M±22(M - l) k . 
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Proof. First note the following relations: 



u k < v k 

Vfc { —muk < Mvk 
-mvh < Muk 

The third one comes from 

M 2 — m 2 

Mu k + mv k = ((M + l) fe - (M — l) fe ) > 0. 

The proof is now an induction on k, following the system resolution order. The 
initial case k = correspond to the first step: x n = b n , leading to 

— uq = m < x n < M = vq. 

Suppose now that the inequalities hold for k e [0 . . .1} and prove them for 
k — I + 1. If Z is odd, / + 1 is even. 



Xn-l-l 



bn-l-1 — Tn-l-\,j%i 



J = n- 

l-l 



< M + max(M«2i, —mv2i) + ma,x(Mv2i+i, —mu2i+i) 

i=0 

/ ^ 

9. 



< M 



1 + ^2 U2i + V2i+1 



V 



< m | l + J2^-^(M + 2)(M + l) M + ^-+^ {M ~2)(M-V 2 ' 

i=0 

^ M-m,^ n JM + -1 M + m /llir „, (M - - 1 

* M ( 1 + ^ (M + 2) (M + i) 2 -l + ^ (M ' 2) (Af-1) 2 -1 

< ^(M + i)^ + ^±^(Af-l)'+ 1 = « I+1 . 

Similarly, 

2 

x„-i-i > m - y^max(M^2j, -mu 2 j) + max(MM 2 j+i, -m^+i) 

i-i 

2 

> m — M ^ f 2 j + M2i+1 



i=0 
1-1 



> m _M^^^(M + 2)(M + l) 2 ^^±^(M-2)(M-i; 



2z 



2 

i=0 



> m - M (' jl^(Af + 2) <%± - 1 - -i±--(M - 2) ( M - 



2 v y (M + 1) 2 -1 2 v ' (M - 1) 2 -1 

(M + 1)i+1 _M + m (M _ 1)i+1=Mi+ ^ 
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For I even, a similar proof leads to 

-vi+i < i„-;-i < u l+x . 

Corollary 4.3. Using the notation of theorem 4-2, 

.^M-to,., M + m 

M < 2 ^ + J ) + 

Moreover this bound is optimal. 



□ 



(m - iy 



Proof. The sequence (vk) is increasing and always greater than (ufc). Thus 
V fe G [0 ... n - 1] \x n -k\ < u k <v k < v n -i- 

Now the vector x = (a;»)ie[i...n] G 2" such that V fc G [0 . . . n— 1] |x„_t| = Vk 
satisfies the system Tx = b with 



1 M 


m 


M 


,6 = 


m 


1 


M 


m 


M 




1 


M 




m 






1 




M 



Therefore the bound is reached. 



□ 



The following corollaries apply this result to the positive and balanced mod- 
ular representations. 

Corollary 4.4 (Positive modular representation). For 1 < i, j < n, ifTij,bi G 
[0 . . .p — 1], then 

\x\<^{p n - l + {p-l) n - 1 ). 

Corollary 4.5 (Balanced modular representation). For 1 < i,j < n, ifTi j,bi G 
[_£_i... 2^], then 

£ < — — 



Remark 4.6. T/ie balanced modular representation improves the bound by a 
factor 0/2™- 1 . 

As a consequence, one can solve a unit diagonal triangular system of dimen- 
sion n using arithmetic operations with integers stored on 7 bits if 



£-i(p»-i + (p-i)«-i)<2T 



for a positive representation and 

p — 1 / f> + 1 



< 2 7 



(4) 



(5) 
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for a balanced representation. 

For instance, using the double floating point representation (53 bits of man- 
tissa) the maximal dimension of the system is 34 (resp. 52) for a positive (resp. 
balanced) representation of Z3. For larger fields, this maximal dimension be- 
comes quickly very small: with p = 1001, n < 5 (resp. n < 6) for a positive 
(resp. balanced) representation. 

In the following, we will denote by £del(P) 7) the maximum dimension for the 
resolution with delayed modular reductions. This dimension is small, and this 
approach can therefore only be used as a terminal case of the recursive block 
algorithm. This first cascade algorithm is characterized by the threshold idol- 
For efficiency, we used in our implementation the BLAS routine trsm to perform 
the delayed computation over Z. Despite the small dimension of the blocks, we 
will see in section 4.4 that this approach can slightly improve the efficiency of 
the computation when the finite field is small. 

4.3 Delaying reductions in the update phase only 

The block recursive algorithm consists in several matrix multiplications of differ- 
ent dimensions. In most cases, the matrix multiplications are done over Z with 
a modular reduction on the result only. But part of these result matrices will be 
accumulated to other matrix multiplications in later computations. Therefore 
these intermediate modular reductions could be delayed even more by allowing 
to accumulate these results over Z as much as possible. 

This technique can be applied within the former cascade algorithm, to pro- 
duce a double cascade structure. The key idea is to split the matrices at two lev- 
els as shown on figure 2: a fine grain splitting with the dimension £<jel of the pre- 




Figure 2: Splitting for the double cascade trsm algorithm 



vious section, and a coarse grain splitting with the dimension i up d a te such that all 
recursive calls of dimension lower than t up date can let the matrix multiplication 
updates accumulate without modular reductions. Choosing i up date = ^winograd 
(from corrolary 3.4) will ensure this property. To adjust together the dimensions 

of the two block decompositions, we Set i S plit = L^WinogradAdelJ *del- 

Algorithm 2 is a loop on every block of column dimension £ U pdatc- For each 
of them, the triangular system is solved using algorithm 3 and the update is 
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Algorithm 2: trsm-rec-BLAS-delayed 



Data: A e Z/ p Z mxm , B G Z/ p Z mx " 
Result: X G Z/ p Z mx ™ s.t. AX = B 
begin 

Compute idci from equation (4 or 5) 
Compute iwinograd from corrolary (3.4) 

inograd AdclJ ^dcl 

foreach block column of A of dimension m x t sp u t of the form Ui 



do 

Xi = trsm-partial-delayed([/i, Bi) 

Xi = Xi mod p 
B\...i-\ — — ViXi 

= mod p 

return X 
end 



Algorithm 3: trsm-partial-delayed 



Data: A G Z/ p Z mxm , 5 G Z/ p Z mx ", m must be lower than t upd ato 

Result: X G Z/pZ mx ™ s.t. AX = S 

begin 

if m < ndei then 
B = B mod p 

X = dtrsm(.4, B) ; /* the BLAS routine */ 

X = X mod p 
else 

/* (splitting of the matrix into blocks of dimension 

LtJ and \fV */ 

AX B 



'A! A 2 ' 


\ Xl ' 




Bi 


A 3 


x 2 




B 2 



X 2 := trsm-partial-delayed(A3, B 2 ) 

B\ := B\ — A 2 X 2 ; /* without modular reduction */ 

X\ := trsm-partial-delayed(^4i, B\) 
return X 



end 
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performed by a matrix multiplication over Z followed by a modular reduction. 
Algorithm 3 is simply the cascade algorithm of the previous section: the block 
recursive algorithm 1 with the fully delayed algorithm as a terminal case. The 
matrix multiplication updates are performed over Z without any reduction of 
the result, since the threshold i U pdate allows to accumulate them. 

4.4 Experiments 

We now compare three implementations of the trsm routine over a word size 
finite field: 

Pure recursive (Pure-Rec): Simply algorithm 1, 

Recursive-BLAS (Rec-BLAS): The cascade algorithm formed by the recursive 
algorithm and the BLAS routine dtrsm as a terminal case. It differs from 
algorithm 3 by the fact that the matrix multiplication B\ := B\ — A2X2 
is always followed by a modular reduction. 

Recursive-BLAS-Delayed (Rec-BLAS-Delayed): algorihtm 2. 

We compare these three variants over finite fields with different cardinalities, 
so as to make the parameters idei and i up dato vary as in the following table: 



p 


\\og 2 p] 


tde\ 


^update 


5 


3 


23 


2147483 642 


1048 583 


20 


2 


8190 


8 388 617 


23 


2 


126 



In the experiments of figure 3, the matrix B is square (m = n). One can 
first notice the gain provided by the use of the first cascade with the delayed 
dtrsm routine by comparing the curves rec-BLAS and pure-rec for p = 5. 
This advantage shrinks when the characteristic gets larger, since td e \ = 2 for 
p= 1048 583 or p = 8 388 61. 

Now the introduction of the coarse grain splitting, delaying the reductions 
in the update phase improves by up to 500 Mfops the computation speed. This 
gain is similar for p = 5 and p = 1 048 583 since in both cases n < i U pdatc and 
there is therefore no modular reduction between the matrix multiplications. 

Lastly for p — 8 388 617, the speed drops down since more reductions are re- 
quired. The variants pure-rec and rec-BLAS are penalized by their dichotomic 
splitting, creating too many modular reductions after each matrix multiplica- 
tion. Now rec-BLAS-delayed has the best efficiency since the double cascade 
structure minimizes the number of reductions. 

We now give a comparison of this implementation with the equivalent routine 
of the original BLAS dtrsm. As for matrix multiplication in section 3.3, we com- 
pare the routines according to two different BLAS implementations (i.e. ATLAS 
and GOTO) and two different architectures. Nevertheless, we do not present 
the results with ATLAS on Xeon architecture due to the surprisingly poor ef- 
ficiency of ATLAS dtrsm during our tests. In the following, ftrsm denotes 
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TRSM over Z/5Z on a P4-3.2Ghz 




500 1000 1500 2000 2500 3000 3500 4000 4500 5000 



TRSM over Z/1 048583Z on a P4-3.2Ghz 




500 1000 1500 2000 2500 3000 3500 4000 4500 5000 



TRSM over Z/838861 7Z on a P4-3.2Ghz 



1000 - 







"7 






Pure-rec + 




Rec-BLAS 




Rec-BLAS-Delayed — « — 



500 1000 1500 2000 2500 3000 3500 4000 4500 5000 



Figure 3: Comparison of the trsm variants for p = 5, 1 048 583, 8 388 617, on a 
Pentium4-3,2Ghz-lGo 
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the trsm routine over 16-bits prime field (i.e. Z 6 5 52 i) using the ZpZ-double 
implementation. 





n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 






















ATLAS 


ftrsm 


0.37s 


1.93s 


5.73s 


23.63s 


62.50s 


91.67s 


121.84s 


166.74s 



GOTO 


ftrsm 


0.25s 


1.66s 


5.08s 


21.47s 


55.95s 


80.77s 


111.57s 


150.81s 


dtrsm 


0.17s 


1.35s 


4.50s 


20.64s 


56.19s 


83.85s 


119.18s 


163.33s 


J trsm 
dtrsm. 


1.47 


1.23 


1.13 


1.04 


1.00 


0.96 


0.94 


0.92 



Table 5: Timings of triangular solver with matrix hand side on a Xeon, 3.6GHz 



n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 



ATLAS 


ftrsm 


0.34s 


2.28s 


7.11s 


30.26s 


77.43s 


112.01s 


158.00s 


214.31s 


dtrsm 


0.26s 


1.95s 


6.37s 


28.60s 


76.44s 


113.78s 


161.19s 


219.31s 


Jtrsm 
dtrsm 


1.31 


1.17 


1.12 


1.06 


1.01 


0.98 


0.98 


0.98 




GOTO 


ftrsm 


0.30s 


2.00s 


6.23s 


26.67s 


68.22s 


104.32s 


137.96s 


192.37s 


dtrsm 


0.21s 


1.61s 


5.36s 


24.59s 


67.35s 


100.42s 


142.43s 


195.79s 


ftrsm 
dtrsm. 


1.43 


1.24 


1.16 


1.08 


1.01 


1.04 


0.97 


0.98 



Table 6: Timings of triangular solver with matrix hand side on Itanium2, 
1.3GHz 



Tables 5 and 6 show that our implementation of exact trsm solving is not far 
from numerical performances. Moreover, on our Xeon architecture, with GOTO 
BLAS, we are able to achieve even better performances than numerical solving 
for matrices of dimension greater than 7 000. 

The good performance of our implementation is mostly achieved with the 
efficient reduction to fast matrix multiplication and the double cascade struc- 
ture. Figure 4 shows the ratio of the computation time of our trsm compared 
with matrix multiplication routine. According to lemma 4.1, this ratio is 1/2 
with u = 3 and 2/3 with u> = log 2 7. In practice, our implementation only 
performs a few recursive calls of Winograd's algorithm, and the ratio appears 
to be between 0.5 and 0.666 as soon as the dimension is large enough, showing 
the good efficiency of the reduction to matrix multiplication. 

5 Finite Field Matrix Factorizations 

We now come to one of the major interest of linear algebra over finite field: 
matrix multiplication based algorithms. The classical block Gaussian elimina- 
tion is one of the most common algorithm to achieve a reduction to matrix 
multiplication [45]. Nevertheless, our main concern here is the singularity of 
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ratio of triangular system solving with matrix right hand side / matrix multiplication 



2 0.65 







FFLAS/FFPACK (GOTO) 

Theoretical ratio for Full Recursive Winograd 

BLAS/LAPACK (GOTO) — b— 
Theoretical ratio for Classical Multiplication - 

















1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 



Matrix dimension 



Figure 4: Comparing triangular system solving with matrix multiplication on a 
Xeon, 3.6GHz 



the matrices since we want to derive efficient algorithms for most problems (e.g. 
rank or nullspace) . One approach there is then to use a triangular form of the 
input matrix. Hence, matrix triangularization algorithm plays a central role for 
this approach. In this section we focus on practical implementations of triangu- 
larization in order to efficiently deal with rank profile, unbalanced dimensions, 
memory management, recursive thresholds, etc. In particular we demonstrate 
the efficiency of matrix multiplication reduction in practice for many linear al- 
gebra problems. 

5.1 Triangularizations 

The classical block LDU or LUP factorizations (see [1]) can not be used due to 
their restriction to non-singular case. Instead one would rather use the LQUP 
factorization of [33]. We here propose a fully in- place variant and analyze its 
behaviour. 

The LQUP factorization is a generalization of the well known block LUP 
factorization for the singular case [5]. Let A be a m x n matrix, we want to 
compute the quadruple < L,Q,U,P > such that A = LQUP. The matrix L 
is lower triangular, P and Q are permutation matrices and U is a rank r upper 
triangular matrix with its r first rows non-zero. 

The algorithm with best known complexity computing this factorization uses 
a divide and conquer approach and reduces to matrix multiplication [33]. Let 
us describe briefly the behavior of this algorithm. 

The algorithm is recursive: first, it splits A in halves and performs a recursive 
call on the top half. After some row permutations, It thus gives the T, Y and 
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L\ blocks of figure 5, together with some row permutations stored in Q. Then, 
after some column permutations {[XZ] = [A 2 iA 2 2]P), the algorithm computes 
G such that GT = X via trsm, replaces X by zeroes and eventually updates 
Z = Z — GY. The third step is a recursive call on Z, followed by an update of 
Q. We let the readers refer e.g. to [3, (2.7c)] for further details. 

Furthermore, our implementation of LQUP also uses the trick proposed in 
[18, §4.2], namely storing L in its compressed form L. 

This triangularization is thus fully in-place. 



1 



Figure 5: Principle of the LQUP factorization 



Lemma 5.1. The dominant term of the time complexity of algorithm LQUP 
with m < n is 





" n 


1 1 


( 


m 


2^-i _ 2 2" - 



LQUP(m,n) = 



The latter is nm 2 — ^m 3 with classical multiplication. 



MM(m). 



Proof. Lemma 2.2 ensures that the cost is 0(m u + nm^ x ). We thus just 
have to look for the constant factors. Then we write LQUP(m,n) = am" + 
finm"- 1 = LQUP(m/2,n) + TRSM(m/2,r) + i?(m/2,r,n-r)+LQUP(m/2,7i~ 
r), where r is the rank of the first m/2 rows. This gives am" + j3nm"^ 1 = 



m(n— r) 
2p 



MM(r) +a(m/2) u 



a{m/2Y +[3n{m/2) u - 1 + 2 ^-\„ 2 [§] MM(r) 

(3(n — r)(m/2) UJ ~ 1 . With m < n, the latter is maximal for r = m/2, and 
then, writing MM(i) = C^x", we identify the coefficient on both sides: (3 — 



2"- 1 1 2"- 1 1 2- 

thc announced terms 



and a 



9 a 
Z 2" 



2"(2"-4) 



. Solving for f3 and a gives 
□ 



5.2 Performance and comparison with numerical routines 

Fast matrix multiplication routine of section 3.2 allowed us to speed up ma- 
trix multiplication as well as triangular system solving. These improvements 
are of great interest since they directly improve efficiency of triangularization. 
We now compare our exact triangularization over finite field with numerical tri- 
angularization provided within LAPACK library [2]. In particular, we use an 
optimized version of this library provided by ATLAS software in which we use 
two different BLAS kernel: ATLAS and GOTO. 
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n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 



ATLAS 


Iqup 


0.32s 


1.84s 


4.89s 


19.34s 


48.94s 


73.86s 


97.50s 


131.11s 


dgetrf 


0.17s 


1.19s 


3.83s 


16.90s 


45.32s 


67.44s 


94.83s 


130.15s 


iqup 
dqetr f 


1.88 


1.55 


1.28 


1.14 


1.08 


1.10 


1.03 


1.01 



GOTO 


Iqup 


0.25s 


1.52s 


4.47s 


17.93s 


44.54s 


67.88s 


89.63s 


119.65s 


dgetrf 


0.15s 


1.03s 


3.33s 


14.84s 


39.58s 


58.61s 


82.89s 


113.47s 


iqup 
dqetr f 


1.67 


1.48 


1.34 


1.21 


1.13 


1.16 


1.08 


1.05 



Table 7: Performance of matrix triangularization (for Z/65521Z and floats) on 
a Xeon, 3.6GHz 





n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 






ATLAS 


Iqup 


0.38s 


2.20s 


6.36s 


25.22s 


61.64s 


89.74s 


127.43s 


163.68s 


dgetrf 


0.20s 


1.47s 


4.61s 


20.26s 


53.57s 


79.37s 


111.66s 


152.42s 


iqup 
dqetr f 


1.85 


1.50 


1.38 


1.25 


1.15 


1.13 


1.14 


1.07 




GOTO 


Iqup 


0.34s 


2.00s 


5.81s 


23.11s 


56.80s 


83.90s 


113.66s 


150.82s 


dgetrf 


0.16s 


1.17s 


3.80s 


17.07s 


46.18s 


69.00s 


97.56s 


134.01s 


Iqup 
dgetrf 


2.21 


1.72 


1.53 


1.35 


1.23 


1.22 


1.16 


1.13 



Table 8: Performance of matrix triangularization (for Z/65521Z and floats) on 
Itanium2- 1 . 3GHz 



Tables 7 and 8 show efficiency obtained with our exact triangularization 
based on fast matrix multiplication and the one obtained with numerical compu- 
tation. There, "dgetrf" computes a floating point LU factorization of a general 
m x n matrix using partial pivoting with row interchanges. Exact computation 
is done in the prime field of integers modulo 65521. We arc now mostly able 
to reach the speed of numerical computations. More precisely, we are able to 
compute the triangularization of a 10 000 x 10 000 matrix over a finite field in 
about 2 minutes on a Xeon 3.6GHz architecture. This is only 5% slower than 
the best numerical computation. 

We could have expected that our speed would have been even better than 
numerical approach since we take advantage of Strassen-Winograd's multipli- 
cation while numerical computations arc not. However, in practice we do not 
fully benefit from fast matrix multiplication since we work at most with matri- 
ces of half dimension of the input matrix due to the recursive structure of the 
algorithm. Then, the number of Winograd calls is at least one less than within 
matrix multiplication routines. In our tests, it appears that we only use 3 calls 
on our Xeon architecture and 1 call on the Itanium2 architecture according to 
matrix multiplication threshold. This explains the better performance on the 
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Xeon compared to numerical routines than the Itanium2 architecture. 

Note also that in order to take even more into account data locality one can 
develop a version of LQUP where blocks are maintained as square as possible. 
Indeed, as soon as the RAM is full, data locality becomes more important 
than memory saves. The TURBO method [22] addresses this issue. A first 
implementation of TURBO has been studied in [18, §4.5] and it reveals to be 
the fastest for large matrices, despite its bigger memory demand [18, Figure 6]. 
This is advocating further uses of recursive blocked data formats and of more 
recursive levels of TURBO. 



5.3 Comparison with the multiplication 

The LQUP factorization and the trsm routines reduce to matrix multiplication 
as we have seen in the previous sections. Theoretically, as classic matrix mul- 
tiplication requires 2n 3 — n 2 arithmetic operations, the factorization, requiring 
at most |n 3 arithmetic operations, could be computed in about ^ of the time. 
However, when Winograd fast matrix multiplication algorithm is used this ratio 
becomes |. Figure 6 shows that the experimental behavior of the factorization 
is not very far from this theoretical ratio. 




Figure 6: Comparing matrix triangularization with matrix multiplication on a 
Xeon, 3.6GHz 



6 Applications 

In this section, we use our matrix multiplication, matrix factorization and matrix 
solvers as basic routines to perform other linear algebra routines. For instance, 
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from the two routines (i.e. LQUP and trsm), one can also directly derive several 
other algorithms, e.g.: 

• The rank is the number of non-zero rows in U. 

• The determinant is the product of the diagonal elements of U (stopping 
whenever a zero is encountered). 

In the following, we first give the theoretical complexities with explicit con- 
stant terms. These constants depend on the kind of matrix multiplication used 
(fast or classical). In order to validate our approach we then compare this 
theoretical ratios to some experimental ones. 

6.1 Nullspace basis 

Computing a right nullspace basis with the LQUP factorization is immediate 
on a m x n full rank matrix, where m < n: if U = [U1U2], the matrix U-f" U 2 
completed with identity matrix yields a basis for the nullspace of A. 

This requires NS(m; n) = LQUP(m; n) + TRSM(m; n — m). which gives 

_L_)MM(m) (6) 

The latter is (m 2 n — |m 3 ) + (n — m)m 2 = 2m 2 n — |m 3 with classical multi- 
plication. One can notice that computing a right nullspace of the transposed of 
the input matrix yields a left nullspace basis. 

6.2 Triangular multiplications 

6.2.1 Triangular matrix multiplication 

To perform the multiplication of a triangular matrix by a dense matrix via a 
block decomposition in halves, one requires four recursive calls and two dense 
matrix-matrix multiplications. The cost is thus TRMM(n) = 4TRMM(n/2) + 
2MM(n/2), solving for TRMM(n) = aMM(n) yields 

TRMM(n) = — 1— MM(n). (7) 

The latter is n 3 with classical multiplication. 

6.2.2 Upper-lower Triangular matrix multiplication 

The block multiplication of a lower triangular matrix by an upper triangular 
matrix is 

A 1 B 1 +A 2 B 3 A 2 B 4 ' 
A4B3 A4B4 



NS{m;n) = ( 



m, 



Ax A 2 




Si 




Ai 


X 


B 3 £? 4 
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The cost is thus UTLT(n) = 2UTLT(n/2) + 2Ti?MM(n/2)+MM(n/2), solving 
for UTLT(n) = aMM(n) yields 



UTLT(n) = 



2" 



(2 W -4)(2 W -2) 



MM(n). 



(8) 



The latter is |n 3 with classical multiplication. 



6.2.3 Upper-Upper Triangular matrix multiplication 

Now the block version is even simpler (of course the lower lower multiplication 
is similar): 

AiSi A 1 B 2 + A 2 B 4 
A 4 B 4 

The cost is thus UTUT(n) = 2UTUT(n/2) + 2TRMM(n/2), which yields 

4 



' M A 2 ~ 


X 


E>i B 2 




A 4 







UTUT(n) 



(2 W -4)(2<" - 2) 



MM(n). 



(9) 



The latter is in 3 with classical multiplication. 



6.3 Squaring 
6.3.1 A x A T 

Suppose we want to compute A times its transpose, even with a diagonal in the 
middle. The block version is 



A 1 A 2 
A 3 A 4 



D 1 



D 4 



AT Al 

An Aj 



AiDiA^ + A2D4A2 A 1 D 1 Aj + A 2 D 4 A 
A 3 D 1 A[ + A A D A Al A z DiAl + A A D 4 A 



Since ADA T is symmetric, the lower left and upper right are just transpose 
of one another. The other corners (upper left and lower right) are computed via 
recursive calls. Thus the arithmetic cost of this special product is AAT(n) = 
4AAT(n/2) + 2MM(n/2) + 3ADD(n/2) + 2(n/2) 2 

Ignoring the cost of the three additions and the diagonal multiplications, 
this yields 

AAT ^ = 2^T^ MM ( n )- ( 10 ) 

The latter is n 3 with classical multiplication. One can note that when A is 
rectangular with m < n the cost extends to 



AAT(m; n) = 



-MM(to). 



(11) 
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6.3.2 Symmetric case 

When A is already symmetric, and if the diagonal is unitary, the constant factor 
decreases. Indeed, in this case A 2 = A^ and then one of the four recursive calls 
is saved. Also one of the remaining three recursive calls is a call to a non 
symmetric AA T . Therefore the cost is now: SymAAT(n) — 2SymAAT(n/2) + 
AAT{n/2) + 2MM(n/2), once again ignoring n 2 . This yields 

SymAAT(n) = ^^yMM(n). (12) 

The latter is |n 3 with classical multiplication. 



6.3.3 Triangular case 

We here view the explicit computation of L T DL for instance as a special case 
of upper-lower triangular matrix multiplication, but where both matrices are 
symmetric of one another. We also show that we can add an extra diagonal 
factor in the middle at a negligible cost. Consider then 

L 3 D X L\ L-sD x Ll + L 4 D 4 LJ 

Thus it requires two recursive calls, a call to AAT (with a diagonal in the 
middle) only one call to TRMM as both lower-left and upper-right corners are 
transpose of one another. This yields 

LTL(ra) = — -i- — MM(n). (13) 

(2" - 4) (2 W - 2) y ' J 

The latter is |n 3 with classical multiplication. 



Li 



D A 



tT t1 



6.4 Symmetric factorization 



For the sake of simplicity, we here consider the LU factorization of a generic 
rank profile symmetric n x n matrix A. We could describe how to perform this 
decomposition with the permutation and the possible rank deficiency in the 
blocks, but we here only analyze the cost of such a LDL T factorization. The 



idea is that one can recursively decompose A = 



Ax 



A 2 
A± 



Li 
G 



' Dx 


X 


' L\ G T 


D 2 


T T 


and Dx ; a TRSM to comput 



Well, this requires a recursive call to compute Lx 



GDxG and a recursive call to compute L 2 D 2 L 2 = A 4 



; an AAT to compute 
GDxG T . The cost is 



thus LDLTin) = 2LDLT(n/2) + TRSM(n/2) + AAT(n/2), which yields 

4 



LDLT(n) 



The latter is |n 



(2«> - 4) (2" - 2) 
1,3 with classical multiplication. 



MM(ra). 



(14) 
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6.5 Matrix inverse 

6.5.1 Triangular matrix inverse 

To invert a triangular matrix via a block decomposition, one requires two re- 
cursive calls and two triangular matrix multiplications. 

A- 1 -A^A 2 A? 
The cost is thus INVT(n) = 2INVT(n/2) + 2TRMM(n/2) which yields 



A\ A 2 ~ 


-l 


A 4 





INVT(n) = 



-TRMM(n) 



(2 W -4)(2 W - 2) 



MM(n). (15) 



The latter is \n 3 with classical multiplication. 



6.5.2 Matrix inverse 

To invert a dense matrix, one needs to compute an LQUP decomposition, then 
to invert L and permute it with A TRSM is then required to solve UX = 

Q~ X L~ X . Applying P^ 1 to X yields the inverse. The cost is then INV(n) = 
LQUP(n) + INVT(n) + TRSM(n). This gives 



INV(n) = 



3x2" 



-MM(n). 



(2"-4)(2 w -2) 
The latter is INV(n) = 2n 3 with classical multiplication. 



(16) 



6.5.3 Symmetric inverse 

If A is symmetric, one can decompose it into a LDL T factorization instead of 
the LU. Therefore, its inverse is then only one INVT for both L _1 and L~ T 
followed by an LTL. The cost is then SymlNV(n) = LDLT(n) + INVT(n) + 
LTL(n) which yields 

SymINV(n) = ^ _ ^ _ 4) MM(n). (17) 

The latter is SymINV(n) — n 3 with classical multiplication. 



6.5.4 Full-rank Moore-Penrose pseudo-inverse 

A is a rectangular full rank mxn matrix. We suppose, without loss of genericity, 
that m < n. The Moore-Penrose inverse of A is thus A' — A T (AA T )~ 1 , see 
e.g. [42] and references therein. Computing the Moore-Penrose inverse is then 
just a LDL T decomposition of the symmetric matrix AA T , followed by two 
rectangular system solvings: 

MPINV(m; n) = AAT(m; n) + LDLT(m) + 2TRSM(m; n). 
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The cost is then 

MPINV(m; n) = ( \ -] + — ) MM(m) (18) 

y 1 V ml 2^-4 (2 W - 2)(2<" - 4)/ x ' v ; 

The latter is 3m 2 n + |m 3 with classical multiplication. This correspond e.g. to 
the normal equations numerical resolution [27, algorithm 5.3.1]. 



6.5.5 Rank deficient Moore-Penrose pseudo-inverse 



In this case, one needs to compute a full-rank decomposition of A. This is done 
by performing the LQUP decomposition of A and if A is of rank r, selecting 



the first r columns of L (call them L r 



G 



) and the first r rows U (call 



them U r = [I7i|y]), forgetting the permutation P. We have A — L r U r and we 
modify the formula [39, (7)] as follows: 



A* 



Y T ur T 



((L 1 +L^ T G T G)(U 1 +YY T U 1 - 1 )y 1 [I\L^ T G T \. (19) 



We note W = (Li + L^ T G T G){U 1 + YY T U^ 1 ). We compute W by two squar- 
ings, two TRSM and a classical matrix multiplication. We perform a reversed 
LU decomposition on W to get W = U W L W . Now we compute L\\J W and 
L w Ui by upper- upper triangular multiplication and H — (L'[U W )~ 1 G T and 

Z = Y T {L w Ul)- 1 by two TRSM. Now, A^ = L '^ 1 ? . W' 1 is two 

triangular inverses and an upper lower product. ZH is a rectangular multipli- 
cation and the last two blocks are obtained by two triangular solvings. 



MPINV r (m;n) = LQUP(m; n)+AAT(r; m—r)+AAT(r; n-r)+3TRSM(r, m—r) 
+ 3TRSM(r, n - r) + MM(r) + LQUP(r) + 2UTUT(r) + 2INVT(r) + UTLT(r) 

+ R(n-r;r;m-r) (20) 

The latter is 2rmn+2r 2 m+2r 2 n+m 2 n— |m 3 — |r 3 with classical multiplication. 
To get an idea, numerical computations based on the Cholesky factorization of 
AA T presented in [8] as faster than SVD or QR or iterative methods would 
require 3m 2 n + 2r 2 m + 3r 3 flops. 



6.5.6 Performances and comparisons with numerical routines 

As for triangular system solving and matrix triangularization, we now com- 
pare performances of matrix inversion for triangular and dense matrices with 
numerical computation and with matrix multiplication. Our comparison with 
numerical computation is still based on LAPACK library with two different 
BLAS kernel (i.e. ATLAS and GOTO). We do not present the result of trian- 
gular matrix inversion over our Xeon architecture according to the bad behavior 
of "dtrsm" function which is the main routine used by LAPACK for triangular 
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matrix inversion. Our base field is the prime field of integers modulo 65521 using 
a Zpz-double representation and we use fast matrix multiplication of section 
3.2. 





n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 






















ATLAS 


tri. inv 


0.11s 


0.70s 


2.17s 


9.21s 


24.21s 


35.53s 


49.95s 


68.26s 



GOTO 


tri. inv 


0.10s 


0.62s 


1.90s 


8.00s 


20.97s 


30.77s 


43.38s 


58.98 


dtrtri 


0.18s 


1.04s 


2.90s 


10.97s 


26.85s 


38.57s 


52.93s 


70.95s 


tri.mv 
dtrtri 


0.56 


0.60 


0.66 


0.73 


0.78 


0.80 


0.82 


0.83 



Table 9: Timings of triangular matrix inversion on a Xcon, 3.6GHz 



n 


1000 


2000 


3000 


5000 


7000 


8000 


9000 


10000 



ATLAS 


tri. inv 


0.19s 


1.03s 


3.02s 


11.91s 


31.71s 


44.43s 


61.37s 


82.55s 


dtrtri 


0.08s 


0.58s 


2.55s 


11.39s 


30.50s 


44.52s 


63.34s 


85.19s 


tri. inv 
dtrtri 


2.25 


1.77 


1.18 


1.05 


1.04 


1.00 


0.97 


0.97 




GOTO 


tri. inv 


0.15s 


0.85s 


2.47s 


10.10s 


26.10s 


38.29s 


53.65s 


72.74s 


dtrtri 


0.08s 


0.61s 


1.96s 


8.77s 


23.68s 


35.73s 


49.84s 


69.10s 


tri. inv 
d.trtri 


1.90 


1.40 


1.26 


1.15 


1.10 


1.07 


1.08 


1.05 



Table 10: Timings of triangular matrix inversion on Itanium2, 1.3GHz 



Tables 9 and 10 illustrate the performances of our exact triangular matrix 
inversion regarding performances of LAPACK routine "dtrtri" . Results show 
that our exact computations tend to catch up with the numerical ones and 
even outperform them on Itanium2 with ATLAS for large matrices (dimension 
greater than 8000). 

One can notice that the implementation of triangular matrix inversion pro- 
vided by GOTO is quite efficient compare to ATLAS, and thus lead our exact 
computation to be more efficient but not better than numerical ones. Here 
again, this demonstrates that exact triangular matrix inversion over finite field 
is not much more costly than its numerical counterpart. 

Now, Tables 11 and 12 provide the same comparisons for dense matrix in- 
version. For numerical computation references we use the routine "dgetri" in 
combination with the factorization routine "dgetrf" to yield matrix inverse. On 
both architecture with ATLAS BLAS kernel, exact computations become the 
most efficient when matrix dimension is getting larger. Numerical computation 
is only better than exact on the Itanium 2 architecture with GOTO BLAS ker- 
nel. In this particular application, the benefit of fast matrix multiplication is 
important since it allows to outperform numerical performances. 

As shown in previous section, matrix inversion algorithms reduce to matrix 
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n 


1000 


3000 


5000 


7000 


8000 


9000 


10000 



ATLAS 


inverse 


0.75s 


13.57s 


54.52s 


141.19s 


206.26s 


285.19s 


385.35s 


dgetrf+dgetri 


0.69s 


16.94s 


80.83s 


222.07s 


368.66s 


531.29s 


761.28s 


inverse 
dqetr f -\-dqetri 


1.09 


0.80 


0.67 


0.64 


0.56 


0.54 


0.51 




GOTO 


inverse 


0.63s 


11.82s 


48.56s 


125.30s 


179.17s 


256.12s 


343.91s 


dgetrf+dgetri 


0.55s 


13.02s 


58.36s 


159.21s 


232.30s 


328.55s 


450.46s 


inverse 
dqetr f+dqetri 


1.15 


0.91 


0.83 


0.79 


0.77 


0.78 


0.76 




Table 11: Timings of matrix inversion on a Xcon, 3.6GHz 




n 


1000 


3000 


5000 


7000 


8000 


9000 


10000 






ATLAS 


inverse 


1.01s 


17.27s 


69.24s 


173.21s 


256.67s 


353.02s 


483.08s 


dgetrf+dgetri 


0.60s 


14.29s 


66.08s 


184.74s 


276.09s 


393.62s 


541.37s 


inverse 
dqetr f -\-dqetri 


1.67 


1.21 


1.05 


0.94 


0.93 


0.90 


0.89 




GOTO 


inverse 


0.85s 


14.92s 


61.00s 


153.78s 


226.68s 


313.84s 


422.78s 


dgetrf+dgetri 


0.47s 


11.45s 


51.33s 


139.00s 


207.36s 


293.02s 


402.72s 


inverse 
dqetr f+dqetri 


1.80 


1.30 


1.19 


1.11 


1.09 


1.07 


1.05 



Table 12: Timings of matrix inversion on Itanium2, 1.3GHz 



multiplication. Figures 7 and 8 show the correlation between matrix inversion 
performances and matrix multiplication performances; triangular and dense case 
are studied. 

According to section 6.5.1, the ratio of triangular matrix inversion and ma- 
trix multiplication is 4/(2" — 4)(2 W — 2); which gives a theoretical ratio of 
1/6 when classic matrix multiplication is used. However this ratio increase 
to « 0.267 when Winograd fast matrix multiplication is used (i.e. lo = log 2 7). 
Since our matrix multiplication routine is using fast matrix multiplication, the 
asymptotic behavior of this ratio should tend to the latter. However we observe 
in practice that our performances are beyond this ratio. This is due to the 
hybrid matrix multiplication which uses both Winograd and classic algorithms. 
So the practical ratio obtained here is really close to the theoretical one since it 
should asymptotically lie between 0.2674 and 0.166. 

From section 6.5.2 one can express the ratio between dense matrix inver- 
sion and matrix multiplication as respectively 1 with classic algorithm and 1.4 
with Winograd algorithm. In practice we observe that dense matrix inversion 
ratio is just above the asymptotic behavior of Winograd based inversion. This 
certainly could be explained by the number of different algorithms involved in 
this application. In particular it involves three different reductions to matrix 
multiplications; which may be of a little influence on the final performances. 
Moreover, we do not take into account memory effect which can play a crucial 
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ratio ol triangular matrix inversion / matrix multiplication 
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1000 2000 300D 4000 5000 6000 7000 8000 9000 10000 
Matrix dimension 



tatirj ol matrix inversion / matr •. nuliiplicaiior 
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Figure 7: Comparing triangular matrix 
inversion with matrix multiplication on 
a Xeon, 3.6GHz 



Figure 8: Comparing matrix inversion 
with matrix multiplication on a Xeon, 
3.6GHz 



role in performances as already demonstrated by ATLAS software with opti- 
mized BLAS [46]. In our test we used a naive approach which leads us to use 
2n 2 elements in memory. Decreasing this memory will certainly allow us to get 
better performances. In particular, it is not known yet how to perform matrix 
inversion in place using a reduction to matrix multiplication. 

7 Conclusions 

We have achieved the goal of approaching the efficiency of the numerical linear 
algebra library but for word-size prime fields. We showed that exact computa- 
tion can benefit from Winograd fast matrix multiplication algorithm and then 
even leads to outperform the efficiency of the well known BLAS and LAPACK 
libraries. 

This performance is achieved through efficient reduction to matrix multi- 
plication where we took care of minimizing the ratio and also by reusing the 
numerical computation as much as possible. We also showed that from our 
routines one can easily implement efficient algorithms for many linear algebra 
problems (e.g. null-space, generalized inverse, etc.). Note that approximate 
timings for these algorithms can be derived from the timings provided with our 
main routines. 

One can try to design block algorithms where the blocks fit in the cache 
of a specific machine to reach very good efficiency. By reusing BLAS library 
this has been proven to be almost useless for matrix multiplication in [17] and 
we think we proved here that this is not mandatory also for any dense linear 
algebra routine. Therefore, using recursive block algorithms, efficient numerical 
BLAS and fast matrix multiplication algorithms one can approach the numerical 
performance or even surpass them over some finite fields. Moreover, long range 
efficiency and portability are warranted as opposed to every day tuning. Except 
for small matrices where the conversions increase slightly the running time, and 
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except for the LQUP transform, we have shown that all our exact routines can 
be faster than their numerical counterparts. 

Besides, the exact equivalent of stability constraints for numerical compu- 
tations is coefficient growth. Therefore, whenever possible, we computed and 
improved theoretical bounds on this growth (e.g. bounds 4.5 and 3.3). Those 
optimal bounds enable further uses of the BLAS routines. 

Further developments include: 

• The main case where our wrapping of BLAS is insufficient is for very small 
matrices where benefits of BLAS are limited and fast algorithms are not useful. 
Here, a design using the finite field directly might improve the speed. 

• More generally, a Self-adapting Software [10] would allow to provide hybrid 
implementations with best empirical thresholds. 

• The technique of wrapping BLAS becomes useless when finite fields are 
larger than the corresponding bound of feasibility (e.g. p > 2 26 for matrix 
multiplication). At a non negligible price the Chinese remainder algorithm 
could be used to authorize the use of BLAS. Optimizing this scheme would then 
be an interesting way to provide similar results for larger finite fields. 

• Finally, extending the out of core versions by more recursive data format and 
the building of a parallel library is promising. Also, in the case of parallelism, our 
all-recursive approach enables a very efficient "sequential-first" parallelization 
as shown e.g. in [19] for triangular system solving. 

A Proof of theorem 3.1 

Consider the natural block decomposition 





C\2 




'All 


A12 




B n 


B\2 


C21 


C 2 2_ 




A 2 1 


A 2 2_ 




B21 


B22_ 



where An and Bn have respectively dimension to/2 x k/2 and k/2 x n/2. 

To bound the intermediate values in the computation of I recursive levels 
of Winograd's algorithm, we will show that the worst case occurs in the com- 
putation of one of the intermediaite products. We will first consider the case 
K = 2 l q and then generalize the result for every K . To end the proof we will 
provide an instance of a computation for which the bound is attained. 

A.l Some properties on the series of the type 2u — v 

Consider the series defined recursively by: 

u t+ i = 2u t - vi 
V1+1 = 2v[ - U[ 
u <0 
,«o>0 

Since 

f U l + l + V l + l = Ul + Vl = ■ ■ ■ = Uq + v 

\ v i+1 - ui+i = 3{vi - ui) = ■ ■ ■ = 3 ;+1 (i;o - u ) 
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It comes 

{(1+3') , (1-3 1 ) 
(1+3 1 ) , (1-3 1 ) 

Thus, the following properties hold: 

ui < and vi > (21) 
ui is decreasing and vi is increasing (22) 
vi > -ui if v > -it (23) 

Now define v A and v B , two series of the type v by setting Uq — rriA, Vq = 
M A , Uq = niB and Vq = Mb- 

Let us also define tj — and s j = • Thus tj +Sj = 1 and t j — Sj = 3 j . 
The following property holds: 

(2M A - m A )tj + (2m A - M A )s 3 = M A t 3+1 + m A s j+ i = vf +l (24) 

A. 2 Notations 

Let 

/1 + 3' 1-3' \ /1 + 3' , 1-3' N 

h = ——M A + ——m A ——M B + -^—m B 





K 


) 


J 1 '. 



The serie (6z)z>o is increasing since (22). 

Winograd's implementation, see e.g. [32, 21], uses the following intermediate 
computations 





Pi 




A n x B n 




P2 




A12 x B 2 i + 0C U 




P3 




(A 12 + A n - A 21 - A 22 ) x B 22 




Pa 




A 22 x (B 22 + Bn — B 2 i — B12) + P{C 22 — C\ 2 — 




P5 




{A21 + A 22 ) x (B w - B n ) + (3C 12 




P e 




(A 21 + A 22 - A n ) x (B 22 + B n - B 12 ) 




Pi 




(An - A21) x (B 22 - B12) + P(C 22 - C12) 




= Ui 




P2 + P1 




u 2 




(A 21 + A 22 - An) x (B 22 - B 12 ) + (A 21 + A 22 ) 




Us 




A 22 x (B 22 - B12) + (A21 + A 22 ) x Bn + /3(C 22 




Ui 




(A 2 i + ^22) x B 2 2 + An x (B12 - B 22 ) + /3Ci2 


Cl2 


= u 5 




C/4 + P3 


C21 


= u 6 




C/3-P4 


C22 


= u 7 




C/3 + P5 



x Bu 
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Remark that the result of the computation is independent of the algorithm 
and is always bounded by i^max( | tua |, |MA|)max(|ms|, Ms|)+/3max(|mc|, \Mc\) < 
(K + \)MaMb- Now this value is always smaller than b\ for k > 1 and also 
smaller than bi \/l > 1. Therefore, the coefficients of the blocks U\, U5, Uq and 
U7 always satisfy the bound. Now if the remaining 9 intermediate computations 
are bounded by bi, we will be done. 

We will prove that the largest intermediate value always occurs in the com- 
putation of P 6 . Consider I recursive levels indexed by j: j — I is the first 
splitting of the matrices into four blocks and j = corresponds to the last level 
where the product is done by a classic matrix multiplication algorithm. The 
recursive algorithm can be seen as a back and forth process: the splitting is 
done from j = I to j = and then the multiplications are done from j = to 
3 = I- 

We also define the following notations: 

• M m A ,M A ,m B ,M B ,m c ,M c ( x ) is an u PP er bound on thc intermediate com- 
putations of X = A xB + (3C with j recursive levels and itia < dij < Ma, 
m B < h,j < Mb and mc < Cij < Mq- k is the common dimension of A 
and B 



m A ,M A ,m B ,M B ,m c ,M, 



max x M 3 ' k 



m A ,M A ,m B ,M B ,m c ,M c 

(X). 

. M{X)^ for M'£* AtmBiMBimo , Ma {X). 

The following formulas correspond to the seven recursive calls: 



max 



( M{P 1 ) 
M(P 2 ) 
M(P 3 ) 
M(P 4 ) 
M(P 5 ) 
M(P 6 ) 
V M(P 7 ) 



= M; 
= M- 
= M. 
= M- 
= M; 
= M. 
= Mi 



M j+1 ' k = 

m A ,M A ,m B ,M B ,m c ,M c 

fc 
2 

m A ,M A ,m,B, Mb, 0,0 

k 
2 

m A .M A .m B .M B .mc,Mc 

J> 2 

2m A -2M A ,2M A -2m A ,m B ,M B ,Q,0 
j k 

J> 2 

m A ,M A ,2m B -2M B ,2M B -2m B .mc-2Mc,Mc-2m c 

J, 2 

2m A ,2M A ,m B — M B ,M B —mB,mc,Mc 
i k 

J- 2 

2m A — M A , 2M A —m A ,2m B ~ M B , 2M B -m B , 0,0 



m A —M A ,M A — m A ,m B — M B ,M B —mB,mc—Mc,Mc —mc 

Moreover, the classic algorithm is used for j = 0: 



m: 



0,k 



m A ,M A ,m B ,M B ,mc,Mc 



M A M B k + /3M C 
= max i in \ .\li;k -inu 

-MAmsk — (3mc 



(25) 



(26) 



A. 3 Some invariants 

Lemma A.l. The following invariants hold in every recursive call: 
1. < -m A < M A , < -m B < M B , < -m c < M c 
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2. m c > m B and M c < M B 

3. M c ~m c < M B - m B 

Proof. From equation (25), one gets invariants (1) and (2). Then invariant (3) 
is a consequence of (1) and (2). □ 

A. 4 Induction for K = 2 l q 

Let IHj be the following induction hypothesis: 

If the invariants of section A. 3 are satisfied then 

M j ' k = \v A ]\v B } — 

lvl m A ,MA,mB,MB,m c ,Mc 1 3 JL 3 ' 23 ' 

Suppose that the previous invariants are satisfied and that IHj is true. We 
will prove that the maximum of (25) is reached during the computation of Pq 
to show that IHj + i is satisfied. 

The conditions on tua, Ma, ms and Mb are satisfied for every recursive 
call. We can therefore apply IHj to every product X e {Pi,P 2 , P3, P4, P5, Pq} 
in order to compare M{X) with M(P 6 ). 

• For Pi = An x B n : 

M(P 6 )-M(P 1 ) = 



> 



[(2M A - m A )tj + (2m A - M A )sj\ x 
[{2Mb - m B )tj + (2m B - M B )s 3 ] - 



V 3 + l V 3 + l - V 3 



v A v B -v A v B 
V 3 + 1 V 3 + 1 V 3 V 3 



And since v A and v B are increasing and positive, we have M(Pq) > M(P\). 

• For P 2 = A 12 x B 2 i + /3C n - with the same argument M(P 6 ) > M(P 2 ). 

• For P 3 = (A 12 + A n - A 2 i - A 22 ) x B 22 : 

M(P 6 ) - M(P 3 ) = V A +1 V B +1 - v f^+Ai2-A 21 -A 22v B 22 

= v t+i v f+i - tt 2M A - 2rn A )tj + (2m A - 2M A )sj]vf 
= v t+i v f+i - ( v f+i - m ^o - M as 3 )v b {2A) 
= <i[tf + i-tf]-^«f 



i + i[v! + ,-vf]-vf +1 vf{2Z) 

> vf +1 [vf +l -2vf] 

> v A +1 3 j [M B -m B ]>0 
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• For P 4 = A 22 x (B 22 + B n - B 2 i - B 12 ) + /3(C 22 - C 12 - C 2 i): with the 
same argument, 

M(P 6 ) - M(P 4 ) - v/+i^f+i - i,^ 2 ^ B22 + Bll - Bl2 - S21 > 



• For P 5 = (A21 + A22) x (B l2 - B n ) + (3C 12 : 
M(P 6 )-M(P 5 ) = vf +1 vf +l -vf^ 



= v f v f+i- u t v f+i+ v f u f 
= vf\2vf}-ufv^ 



and since u^ 1 < < u/, vf,vf +1 it comes M(P 6 ) - M (P 5 ) > 0. 
• For P 7 = (An - A 21 ) x (P 22 - P12) + /?(C 22 - C 12 ): using P 5 , 
M(P 5 )-M(P 7 ) = v ^i+a 22v b 12 - Bii _ v A 11 -A 21y B 22 ~B 12 

= [2M A tj + 2m A s J - (M A - m A )tj - (m A - M A )sA x 

[(M B - m B )tj + (m B - M B ) Sj ] 
= [(M A + m A ){tj + 8,)] [(M B - m B )(t 3 - Sj )} 




The coefficients of the blocks U\,U$, Uq and U7 are bounded by kM A M B + 
[3Mc and are therefore smaller than the ones in P 6 . 

Lastly, we must control the size of the coefficients in U 2 = Pi + P 6 , [/ 3 = 
U 2 + P 7 and U 4 = U 2 + P 7 . 

• For U 2 = (A 2 i + A22 - A n ) x (P 22 - B 12 ) + (A 21 + A 22 ) x P n : 

/ (2M A - m A )(M B - m B ) + 2M A M B \ 
Vx e U 2 , \x\ < max (-2m A + M A )(M B - m B ) - 2m A M B \k/2 j 
\ (-2m A + M A )(M B - m B ) - 2M A m B J 

(27) 

Now 2M A -m A - (-2m A + M A ) = M A + m A > and < -m A < M A , so 
the 27 simplifies into Va; e f7 2 , \x\ < (2M A - m A )(M B - m B ) + 2M A M B . 



M(P 6 )-M(U 2 ) > (2M A -m A )(2M B -m B )-(2M A -m A )(M B -m B ) 
-2M A M B 
= (2M A - m A )(M B ) - 2M A M B 
= -m A M B > 



40 



Prime Field Linear Algebra 



J-G. dumas, P. Giorgi, C. Fernet 



• For U 3 = A 22 x (B 22 - B 12 ) + (A 21 + A 22 ) x B n + [3(C 22 - C 12 ): with the 
same argument 

/ (M A (M B - m B ) + 2M A M B )k/V + \0\{M C - m c ) \ 
Vx e U 3 , \x\ < max {M A {M B - m B ) - 2m A M B )k/2^ + \0\{M C - m c ) 

\ (M A (M B - m B ) - 2M A m B )k/V + \0\{M C - m c ) ) 

The max is always equal to its first argument, and since > 1, < 
M A — m A and Mc — mc < M B — m B , we have: 

\x\ < (M A (M B -m B ) + 2M A M B )k/y + (3(M C -m c ) 

< (2M A - m A )(M B - m B ) + 2M A M B )kj2 3 

< M(U2) < M(P6) 

• For Ui = (A 2 i + A 22 ) x B 22 + An x (B 12 - B 22 ) + 0C\ 2 : with the same 
argument as for U 3 , 

Vx e Ui, \x\ < {M A {M B - m B ) + 2M A M B )k/2 j + \0\M C 
Since Mc < M B — m B , —m A < M A and —m B < M B , we have 
M(Ui) < M{U 3 ) < M(P 6 ). 

Finally M>£'* Ma = M(P 6 )^r - vf +1 vf +1 ^, and IH j+1 is satis- 
fied. 

For the initialization of the induction (j = 1), the products of the blocks are 
done by the classical algorithm. From (25) and (26), one gets: 

M A M B k/2 
M A M B k/2 + \f3\Mc 
2{M A - m A )M B k/2 
2M A (M B - m B )k/2 + \/3\{2M c - m c ) 
2Ma(M b - m B )k/2 + \p\Mc 
(2M A - m A )(2M B - m B )k/2 
{M A - m A ){M B - m B )k/2 + \/3\{M c - m c ) 
{2M A - m A ){M B - m B )k/2 + 2M A M B k/2 
M A (M B ~ m B )k/2 + 2M A M B k/2 + \/3\{M c - m c ) 
2M A M B k/2 + M A {M B - m B )k/2 + \(3\M C 

Again, we will prove that M^ MA mB MB mc Mc (P 6 ) reaches the highest value, 
using invariants of section A. 3, and the fact that \0\ < M A , M B and k>2. 
It is straightforward for Pi and P 2 . 



M„ 
M„ 
M, 
M, 
M, 

M, 
M 
M 



l.k 



AI 



l.k 

m A 

l,k 
m A 

l,k 
m A 

l,k 
m A 

l,fc 
m A 

l,fc 
m A 

l,fc 
m A 



,M A ,m B 
,M A ,m B 
,M A ,m B 
,M A ,m B 
,M A ,m B 
,M A ,m B 
,M A ,m B 
M A ,m B , 
M A ,m B . 
M A ,m B , 



M B .mQ 
M B .rriQ 
M B ,m c 
M B < m C 
M B ,m c 
M B ,mc 
M B ,m c 
M B: m c 
M B ,m c 
M B ,m c 



,M C (-P2) 
,M C (P3) 
,M c i P i) 
,M C {PS) 

, Mc (Pe) 

,M C ( P ?) 
,M C (U2) 
,M C (U 3 ) 
,M C (U4) 
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• For P 3 : 

M^,...(P 6 )-M^,...(P 3 ) = ((2M A - m A )(2M B - m B ) - 2(M A - m A )M B )k/2 

= (2M A M B k - (2M A - m A )m B )k/2 > 

• For P 4 : Since -|/3|(2M C - m c ) > -M A (2M B - m B ), we have 

M^,...(P6)-M^,...(P 4 ) = {{2M A -m A ){2M B -m B )-2M A {M B -m B ))k/2 

-\(3\(M c -2m c ) 

> (M A -m A )(2M B -m B )-2M A (M B -m B ) 
= m A (m B -2M B )>Q 

• For P 5 : M mA MA mg MB mc Mc (P 5 )<M mA MA mB MB mc Mc (P4) 

• For P 7 : 

M^,...(P 6 )-M^,...(P 7 ) = ((2M A -mA)(2M B -m fl ) 

-(M A - m A )(M B - m B )k/2 - \/3\(M c - m c ) 

> M A (2M B - m B ) + (M A - m A )M B - M A (M B - m s ) 

> (2M A - m A )M B > 



• For £/ 2 , ^4: using the same argument as for the case of arbitrary j. 
IH\ is then satisfied. 



A. 5 Case of an arbitrary k 

Let I be such that 2 l d < k < 2 l (d+ 1) ( d = [^-J ). A dynamic peeling technique 
[31] is used to deal with odd dimensions: at each recursive level, the largest 
blocks with even dimensions at the top left hand corner of the input matrices 
are multiplied using Winograd's algorithm. Then an optional rank 1 update is 
applied, with the odd dimensions. 

These updates are using matrix-vector products, dot products and tensor 
products. Every intermediate result during these computations are therefore 
bounded in absolute value by kM A M B + \(3\M C < (k + l)M A M B 

We show now that this bound is always under the one of Winograd's algo- 
rithm. 

V/ > 1 2 l (d + 1)M A M B < vfvj 3 

(since (k + l)M A M B < 2 l (d + l)M A M B ). 

• Fori = 1, the inequation is satisfied: 2M A M B {d+l) < (2M A -m A )(2M B - 
m B )d (since d > 1) 
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• Let us suppose that it is satisfied for Z > 1 and prove it for I + 1: 
k 



v i+i v i+i 



2 i+i 



= [{2M A - m A )ti + {2m A - M A )s{\ 

x [{2Mb - m B )ti + {2m B - M B )si]d 
> 2[M A ti + m A si] [M B ti + m B si]2d 
k 



2' 



> 2(2 l M A M B (2d+l)) 

> 2 l+1 M A M B {d+l)) 

By induction, the bound of section A. 4 is valid for any k. 



A. 6 Optimality of the bound 

We simply build a sequence of square matrices Ai and Bi of order 2 l for which I 
recursive calls to Winograd's algorithm will involve intermediate results equals 
to the bound. 

Let and be recursively defined as follows: 



A l 



Ai+i 



m A 
M A M A 

Ai 



Bi 



B. 



l+i 



M B m B 
M B 



Bi Bi 
Bi 



where A t j = M A + m A — Aij and £?jj = M B + m B — Bij. 

Since at each recursive level, the computation of P6 = (A%i + A22 — An) x 
(B22 + B11 —B12) involves the largest possible intermediate values, let us define: 



S(Ai) = {Ai) 2 .i + {Ai) 2 .2 - {Ai) 1A = 2A i _ 1 - = 3^_i - J ; _i 

where is the square matrix of order 2 k whose coefficients are all equals to 
M A + m A . 

Moreover S(Jk) — Jk-i- Thus, applying P 6 I times recursively, since S is 
linear: 

S(S(. . . (S(Ai)))) = S l (Ai) = S^SiA,) - (j2 3 " J J i 

\fe=o / 

Then S(A\) — 2M A — m A and J\ — M A + m A imply: 

S l (Ai) = 3 l - 1 (2M A - m A ) - ^—^-(M A + m A ) = i±l 
The same holds for Bf. 

S\Bi) = — ^——M B + —^m B 



nr 1 - 3' 

M A + ——m A . 
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The order of Ai and P>i is k = 2 l , so [Ji-J = 1. Therefore, the computation 
of Ai x Bi with I recursive levels of Winograd's algorithm involves intermediate 
values equals to vf'vf 1 [^-J . This proves the optimality of the bound. 

Note that this bound is unchanged for computations of the type A x B+f3C. 
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